Application 3

Author

Martin Hsu

More Markov Chains

Setup

Code
library(tidyverse)
library(markovchain) # for plotting state diagram
library(igraph)
library(expm)
library(viridis)
library(viridisLite)
library(ggpubr)
library(grid)
library(gridExtra)
library(prismatic) # for auto-contrast colors (black/numbers number in transition matrix heat map)
library(colourvalues) # for using with viridis
library(kableExtra)
Code
set.seed(545)
Code
compute_stationary_distribution <- function(P){

  s = nrow(P)

  rep(1, s) %*% solve(diag(s) - P + matrix(rep(1, s * s), ncol = s))

}
Code
plot_state_diagram <- function(transition_matrix, state_names = NULL) {
  
  n_states = nrow(transition_matrix)
  
  if (is.null(state_names)) state_names = 1:n_states
  
  dtmc <- new("markovchain",
              states = as.character(state_names),
              transitionMatrix = P,
              name = "dtmc")
  
  dtmc_igraph <- as(dtmc, "igraph")
  
  # to force limits at (0, 1)
  prob_color_scale = data.frame(prob = (0:1000) / 1000,
                                color = colour_values((0:1000) / 1000,
                                                      palette = "magma"))
  
  prob_color = data.frame(prob = round(E(dtmc_igraph)$prob, 3)) |>
    left_join(prob_color_scale) |>
    pull(color)
  
  E(dtmc_igraph)$color = prob_color
  
  edge_attr(dtmc_igraph, "label") <- round(E(dtmc_igraph)$prob, 2)
  
  plot(dtmc_igraph,
       layout = layout_in_circle,
       edge.curved = 0.2,
       vertex.size = 30,
       vertex.color = colour_values(V(dtmc_igraph)$name, palette = "viridis"),
       vertex.label.color = after_scale(best_contrast(colour_values(V(dtmc_igraph)$name,
                                                                    palette = "viridis"),
                                                      c("white", "black"))),
       edge.color = "grey",
       #     edge.label.color = prob_color,
       edge.label.color = "black",
       edge.arrow.size = 0.5)
  
}

Part 1

This part concerns this Riddler Classic problem from FiveThirtyEight. Note: you can find solutions here, but you should try the problem on your own without looking at solutions.

You roll a fair six-sided die 6 times. Whatever the results, you paint those on the sides of a blank die. So, if your rolls were 4, 5, 2, 6, 1, 1, then your new die has no 3’s and two 1’s. Then, you repeat the process with your new die — roll it 6 times and paint the results on a blank die. Eventually, you’ll roll the same number 6 times, at which point the process stops. Let \(T\) = the number of rounds (of 6 rolls each) that you perform until stopping. (If your 6 rolls in the first round all result in the same number, then you stop with \(T=1\).)

1)

Code and run a simulation to approximate the distribution of \(T\) and its expected value, without using Markov chains. Summarize the approximate distribution in an appropriate plot, and describe the distribution in 2-3 sentences (e.g., compute and interpret a few percentiles).

We can code a simulation to use iteration rather than Markov chains. For a single repetition, we can keep “rolling” until we achieve the same number 6 times, and incrementing a counter each time we roll to simulate \(T\).

Code
# Start with a fair 6-sided die
dice <- 1:6
# Initialize T to 0
`T` <- 0
# Setup a table to track our progress
dice_summary <- c(dice)

# Repeat reroll process until all numbers are the same (aka variance 0)
while (var(dice) != 0) {
  # Fill in new dice
  dice <- sample(dice, 6, replace=TRUE)
  # Increment T
  `T` <- `T` + 1
  # Update tracking table
  dice_summary <- rbind(dice_summary, c(dice))
}

# Format and print table
colnames(dice_summary) <- c("side1", "side2", "side3",
                            "side4", "side5", "side6")
rownames(dice_summary) <- 0:`T`
as.data.frame(dice_summary)

Here, we can see it took \(T=\) 4 tries to get a sequence of all 2’s.

We can functionalize this repetition and repeat it a large number of times to get an approximate distribution for \(T\).

Code
reroll <- function() {
  dice <- 1:6
  `T` <- 0
  
  while (var(dice) != 0) {
    dice <- sample(dice, 6, replace=TRUE)
    `T` <- `T` + 1
  }
  
  return(`T`)
}
Code
reps <- 10000
`T` <- numeric(reps)
for (i in 1:reps) {
  `T`[i] <- reroll()
}
head(`T`)
[1]  9  8  6  4 14  6
Code
hist(`T`, probability = TRUE)

Code
quantile(`T`) %>%
  as.matrix %>%
  t() %>%
  as.data.frame() %>%
  mutate(mean = mean(`T`), sd = sd(`T`))

This approximate distribution is right skew with a range of [1, 53]. The mean is 9.6254, the standard deviation is 5.8356902, and the median is 8.

2)

Define a Markov chain that will help you find \(E(T)\). Be sure to clearly define the state space. (Note: there are several ways to do this; any one is fine just be clear in your choice.)

There are many ways to define a state space for \(T\). We can define one on:

  • Every combination or permutation of numbers on a dice: \(\{111111,111112,111113,...,666666\}\)
  • Every unique set of numbers on a dice: \(\{1,2,3,4,5,6,12,13,...,23456,123456\}\)
  • The number of unique sides on a dice: \(\{1,2,3,4,5,6\}\)
  • Etc.

The best state space will minimize the number of states while still preserving an appropriate amount of information. A good state space that we can use would therefore be a variation on the “every combinations” approach. In this approach:

  • Each state is an indexed list of 6 numbers.
  • The value at index \(i\) would represent the count of sides that contain the number \(i\).
  • Each value at \(i\in\{1,...,6\}\) would therefore be an integer in the range \([1,6]\).
  • The sum of the values of the list must be 6.

For example:

  • \(1,1,1,1,1,1\) would mean that all numbers 1 thru 6 are equally represented on the die.
  • \(2,3,0,0,0,1\) would mean that there are two sides labeled 1, three sides labeled 2, and 1 side labeled 6 on the die.

The full state space would look like the following, with each row being an individual state:

Code
state_space <- expand.grid(0:6, 0:6, 0:6,
            0:6, 0:6, 0:6)
colnames(state_space) <- 1:6
state_space <- state_space %>%
  filter(rowSums(across(everything())) == 6)
state_space

3)

Determine the transition matrix for your Markov chain. You might want to compute a few of the transition probabilities by hand, but you’ll probably need to write code to fill in the whole matrix.

To find the transition probabilities, firstly we need to know which ones will be populated or not. We know the following:

  • If the current die/state does is not labeled with number \(i\) on any side, then the next die cannot have number \(i\). On the other hand, if the current die has number i, then the next die may or may not have state i.
  • In other words, the next state proceeding from the current state must have nonzero indices that are a subset of the nonzero indices of the current state.
  • Therefore, we can check this condition first, and then populate the matrix with the multinomial probability that we roll the next state with the die from the current state.

For example, say our current state is \(2,1,0,0,3,0\). This means that our current dice has two sides that are labeled 1, one side that is labeled 2, and three sides that are labeled 5.

  • Thus, the next state must have sides labeled some combination of \(\{1,2,5\}\).
  • The state \(0,2,1,1,1,1\) is therefore not a possible next state because this corresponds to the labels \(\{2,3,4,5,6\}\). It is not possible for sides to be labeled \(3,4,6\) since they cannot come from a die with only labels \(\{1,2,5\}\). The transition probability for \(p(2,1,0,0,3,0\to0,2,1,1,1,1)\) therefore equals 0
  • The state \(3,0,0,0,3,0\) is a possible next state because this corresponds to a die labeled with sides 1 and 5, which can come from the previous die since it has labels \(\{1,2,5\}\), which includes 1 and 5.
  • We can then calculate the multinomial transition probability \(p(2,1,0,0,3,0\to3,0,0,0,3,0)\) as follows:

\[ p(2,1,0,0,3,0\to3,0,0,0,3,0)=\begin{pmatrix}6\\3\ 0\ 3\end{pmatrix} (\frac{2}{6})^3(\frac{1}{6})^0(\frac{3}{6})^3 \]

Code
dmultinom(c(3,0,3), prob=c(2,1,3)/6)
[1] 0.09259259

\[ =0.0926 \]

  • Notice how the current state informs the probabilities of the multinomial distribution, while the next state informs the combination of the multinomial distribution.

Let’s apply this logic to the entire transition matrix.

Code
# Initialize P
P <- matrix(nrow = nrow(state_space), ncol = nrow(state_space))

# Define state names for P
state_names <- state_space %>%
  unite(concat, everything(), sep="") %>%
  pull
Code
# Populate P
for (i in 1:nrow(state_space)) {
  # Current state
  x_current <- unlist(state_space[i,])
  # Valid die labels for next state
  x_current_check <- which(x_current != 0)
  for (j in 1:nrow(state_space)) {
    # Next state
    x_next <- unlist(state_space[j,])
    # Next state die labels
    x_next_check <- which(x_next != 0)
    
    # Check to see if next die labels are subset of current
    if (all(x_next_check %in% x_current_check)) {
      # If so, calculate multinomial probability
      P[i, j] <- dmultinom(x_next, prob = x_current / 6)
    } else {
      # Otherwise, fill with 0
      P[i, j] <- 0
    }
  }
}
Code
colnames(P) <- state_names
rownames(P) <- state_names
as.data.frame(P)

4)

Use the transition matrix and tools from class to solve for \(E(T)\). Compare to the simulated value from part 1.

We can now use our transition matrix and treat this as a mean time to absorption (MTTA) problem. Notice how the states \(\{600000,060000,...,000006\}\) make an identity matrix. This is because they are absorbing states.

Code
absorb_states <- c("600000", "060000", "006000", "000600", "000060", "000006")
as.data.frame(P[absorb_states, absorb_states])

Since we start with a fair six-sided die, we can interpret this as looking for the MTTA \(\mu_{111111}\). We can find this using computation.

Code
Q <- P[!(rownames(P) %in% absorb_states),
       !(colnames(P) %in% absorb_states)]
I <- diag(nrow = nrow(Q))

mtta <- solve(I - Q, rep(1, nrow(Q)))
as.data.frame(mtta)
Code
mtta["111111"]
  111111 
9.655991 

Therefore, \(E[T]=\mu_{111111}=9.656\) rounds. This is close to the simulated value from part 1.

5)

Use the transition matrix and tools from class to solve for the distribution of \(T\), and display the distribution in an appropriate plot. Compare to the simulated distribution from part 1.

We know that

\[ P(T\le n|X_0=i)=\sum_{\text{absorbing }a}p^{(n)}(i,a),\ n=1,2,... \]

Therefore, for each n, we can find \(P(T\le n)\) (the condition \(X_0=111111\) is implied) by summing over the transition probabilities from state \(111111\) to the absorbing states. For instance, if \(n=8\), we can calculate the probability below.

Code
sum((P %^% 8)["111111", absorb_states])
[1] 0.5256652

The probability \(P(T=n)\) can therefore be found by the following property for discrete distributions:

\[ P(T=n)=P(T\le n)-P(T<n)=P(T\le n)-P(T \le n-1) \]

Therefore, following our example, we calculate \(P(T=8)\) as seen below.

Code
sum((P %^% 8)["111111", absorb_states]) - sum((P %^% 7)["111111", absorb_states])
[1] 0.08541176

We can once more functionalize this process to get a full distribution.

Code
prob_T <- function(P, n, x0, absorb_states) {
  p1 <- sum((P %^% n)[x0, absorb_states])
  if (n == 1) {
    return(p1)
  } else if (n == 0) {
    return(0)
  }
  p2 <- sum((P %^% (n - 1))[x0, absorb_states]) 
  return(p1 - p2)
}
Code
T_vals <- 0:60
T_dist <- numeric(length(T_vals))
for (i in 1:length(T_vals)) {
  T_dist[i] <- prob_T(P, T_vals[i], "111111", absorb_states)
}

cbind(T_vals, T_dist) %>%
  as.data.frame() %>%
  rename(n = "T_vals", `P(T=n)`="T_dist")
Code
plot(T_vals, T_dist, type="l", main="Distribution of T", 
     ylab = "Density, P(T=n)", xlab = "n")

We can overlay this plot onto our simulated values to see how closely they match.

Code
hist(`T`, probability = TRUE, ylim = c(0, 0.1))
lines(T_vals, T_dist, col="red")
points(T_vals, T_dist, col="red")

As seen in the plot, the distribution created from the transition matrix closely matches that of the simulated distribution.

Part 2

In this part you will create and solve your own problems involving Markov chains. You have lots of flexibility, but your problems should include at least one part that requires

  • A simulation
  • An analytical solution that uses either first step analysis/absorption
  • An analtyical solution that uses stationary distributions/long run behavior

Problem

A dataset I like to work with that is actually updated every few minutes is the the USGS earthquakes dataset, which lists earthquakes from the past 30 days.

Code
cat("Data Pull Datetime:", as.character(Sys.time()), Sys.timezone())
Data Pull Datetime: 2024-02-03 00:31:30.891321 America/Los_Angeles
Code
quakes <- read_csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv")
head(quakes)

We can use the magnitude in this earthquakes dataset to explore the following questions:

We can treat the (floor rounded) magnitudes of the earthquakes as a markov chain.

Code
mags <- floor(quakes$mag)
mags <- mags[mags >= 0]
head(mags, 20)
 [1] 2 3 3 1 1 1 2 1 3 2 2 2 1 2 1 1 2 4 2 1

Using this data, we can answer the following questions.

  1. Construct a transition matrix for the magnitudes. Is this matrix irreducible?

  2. Find the stationary distribution for the matrix, if it exists. Compare it to the distribution of magnitudes in the data.

  3. Modify the transition matrix to one that can help us find the number of earthquakes until the next 5+ magnitude earthquake, \(T\).

  4. Using the new transition matrix in part 3, find the mean time to absorption for each magnitude - that is, then mean number of earthquakes until the next 5+ magnitude earthquake, starting from each 4- magnitude state.

  5. Using the transition matrix, simulate chains starting with the same magnitude as the observed data. Find \(E[T]\) and plot the distribution of T. Then overlay an exponential distribution with parameter equal to \(E[T]\). What do you notice?

  6. Can this chain reasonably be considered a first-order Markov chain? Evaluate using both an appropriate visual and a formal test.

  7. Make another visual to explore if occurrences are independent. Connect the results back to previous findings. Is the Markov property necessary to model this behavior?

1)

Construct a transition matrix for the magnitudes. Is this matrix irreducible?

We can use the markovchain package to get maximum likelihood estimates for each transition probability in the matrix.

Code
mc_fit <- markovchainFit(mags, method = "mle")
Code
state_names <- mc_fit$estimate@states
P <- mc_fit$estimate@transitionMatrix
as.data.frame(P)

This matrix appears to be irreducible, due to the presence of nonzero diagonal elements. All states are therefore recurrent. This becomes more apparent when viewing a state diagram.

Code
plot_state_diagram(P, state_names)

2)

In order to find the stationary distribution for the matrix, we can set up a system of equations to solve for it.

\[ \begin{matrix} \pi(0)=p(0,0)\pi(0)+p(1,0)\pi(1)+\cdots+p(10,0)\pi_{10}\\ \pi(1)=p(0,1)\pi(0)+p(1,1)\pi(1)+\cdots+p(10,1)\pi_{10}\\ \vdots\\ \pi(1)=p(0,10)\pi(0)+p(1,10)\pi(1)+\cdots+p(10,10)\pi_{10} \end{matrix} \]

We can quickly solve this by computing in R with matrix operations.

Code
compute_stationary_distribution(P) %>%
  as.data.frame()

Comparing to the distribution of magnitudes in the observed data:

Code
prop.table(table(mags)) %>%
  as.matrix() %>%
  t() %>%
  as.data.frame()

The stationary distribution is very similar to the empirical proportions in the data. This supports the theory that in the long-run, the distribution of states in an irreducible Markov chain is going to be the same as the stationary distribution.

3)

Modify the transition matrix to one that can help us find the number of earthquakes until the next 5+ magnitude earthquake, \(T\).

We can modify this matrix to help us find the number of earthquakes until the next 5+ magnitude earthquakes by turning magnitudes/states 5, 6, 7, 8, 9 and 10 into absorbing states. We can do this by replacing the appropriate transition probabilities with the identity matrix, as seen in the code below.

Code
max_mag <- max(mags)
P_absorb <- P
for (state in as.character(5:max_mag)) {
  if (state %in% colnames(P_absorb)){
    new_dist <- setNames(rep(0, max_mag+1), 0:max_mag)
    new_dist[state] <- 1
    P_absorb[state,] <- new_dist
  }
}
as.data.frame(P_absorb)

We can observe these changes in the state diagram:

Code
plot_state_diagram(P_absorb, state_names)

Notice how vertices 5, 6, and 7 in the state diagram have an outdegree of 0.

4)

Using the new transition matrix in part 3, find the mean time to absorption for each magnitude - that is, then mean number of earthquakes until the next 5+ magnitude earthquake, starting from each 4- magnitude state.

We can set up the following system of equations to solve for the mean steps until we reach an absorbing state, or the mean time to absorption (MTTA).

\[ \begin{matrix} \mu_0=1+p(0,0)\mu_0+p(0,1)\mu_1+\cdots+p(0,4)\mu_4+p(0,5)(0)+\cdots+p(0,10)(0) \\ \mu_1=1+p(1,0)\mu_0+p(1,1)\mu_1+\cdots+p(1,4)\mu_4+p(1,5)(0)+\cdots+p(1,10)(0) \\ \vdots\\ \mu_4=1+p(4,0)\mu_0+p(4,1)\mu_1+\cdots+p(4,4)\mu_4+p(4,5)(0)+\cdots+p(4,10)(0) \\ \end{matrix} \]

We can solve this using computation in R as well, by condensing this to the following matrix algebra relation.

\[ \boldsymbol{\mu=1+\textbf{Q}\mu} \]

Where:

  • \(\mu\) is the vector of MTTAs
  • \(\textbf{Q}\) is the transition submatrix of transient states to transient states
Code
Q <- P_absorb[as.character(0:4), as.character(0:4)]
I <- diag(nrow = 5)

solve(I - Q, rep(1, 5)) %>%
  as.matrix() %>%
  t() %>%
  as.data.frame()

Interestingly, we find that for every transient magnitude/state in the chain, it takes on average 85-86 earthquakes to get to an absorbing magnitude/state. This makes sense given the structure of the graph, as generally the transient states are fully connected to each other in a clique.

5)

Using the transition matrix, simulate chains starting with the same magnitude as the observed data. Find \(E[T]\) and plot the distribution of T. Then overlay an exponential distribution with parameter equal to \(E[T]\). What do you notice?

Let’s start out with a single chain. First, let’s enumerate our starting distribution, \(\pi_0\). We want to 100% start with the same starting state as our observed data, so we want our starting distribution to reflect that.

Code
X0 <- mags[1]
pi_0 <- setNames(rep(0, max_mag+1), 0:max_mag)
pi_0[as.character(X0)] <- 1
pi_0 %>%
  as.matrix() %>%
  t() %>%
  as.data.frame()

Next, let’s generate states until we see a magnitude of 5 or above.

Code
# Sample from the starting distribution
state <- sample(state_names, 1, prob = pi_0)
# Initialize index
i <- 2

# Initialize chain
path <- numeric()
# Set starting state
path[[1]] <- state
# Keep generating states until we encounter absorbing state
while (!(state %in% 5:10)) {
  state <- sample(state_names, 1, prob = P[as.character(path[[i - 1]]),])
  path[[i]] <- state
  i = i + 1
}

cat(path)
2 2 1 0 1 0 0 1 0 1 0 1 1 1 1 1 2 1 1 1 1 0 1 0 1 2 1 1 2 1 4 1 2 0 4 4 1 2 2 1 1 1 0 0 2 2 1 1 4 1 1 2 2 2 1 2 1 1 3 1 0 2 0 1 0 1 4 1 1 2 4 1 0 1 3 2 1 5

Let’s functionalize this process to generate a large number of repetitions

Code
sim_quakes <- function(P, pi_0, state_names, absorb) {
  state <- sample(state_names, 1, prob = pi_0)
  i <- 2
  
  path <- numeric()
  path[[1]] <- state
  while (!(state %in% absorb)) {
    state <- sample(state_names, 1, prob = P[as.character(path[[i - 1]]),])
    path[[i]] <- state
    i <- i + 1
  }
  
  return(path)
}
Code
reps <- 10000
`T` <- numeric(reps)
for (i in 1:reps) {
  `T`[[i]] <- length(sim_quakes(P, pi_0, state_names, 5:10)) - 1
}

Here are some sample values of \(T\):

Code
cat(head(`T`))
129 22 112 352 9 17

Here is a simulated distribution of \(T\). Overlaid is the theoretical distribution, calculated using the following properties:

\[ P(T\le n|X_0=i)=\sum_{\text{absorbing }a}p^{(n)}(i,a),\ n=1,2,... \]

\[ P(T=n)=P(T\le n)-P(T<n)=P(T\le n)-P(T \le n-1) \]

Code
T_vals <- 1:800
T_dist <- numeric(length(T_vals))
for (i in 1:length(T_vals)) {
  T_dist[i] <- prob_T(P_absorb, T_vals[i], X0, 5:max_mag)
}

hist(`T`, prob=TRUE)
lines(T_vals, T_dist, col="blue")

Here is the mean:

Code
cat(mean(`T`))
86.4845

When we overlay the exponential distribution, we find that it closely follows the distribution of \(T\).

Code
hist(`T`, prob=TRUE)
lines(0:800, dexp(0:800, 1/mean(`T`)), col="red")

6)

Can this chain reasonably be considered a first-order Markov chain? Evaluate using both an appropriate visual and a formal test.

We can first construct a 3-dimensional crosstab of the previous, current, and next states, and then use this to plot a stacked bar chart. If the stacked bar charts are all level, then the sequence demonstrates the Markov property. If not, then the the sequence likely does not demonstrate the Markov property.

Code
x_summary <- table(lead(mags, 1), mags, lag(mags, 1)) %>%
  as.data.frame() %>%
  rename(x_previous = "Var1", x_current = "mags",
         x_next = "Var3", count = "Freq")

x_summary %>%
  ggplot(aes(x = x_previous, y = count, fill = x_next)) +
  geom_bar(position = "fill",
           stat = "identity") +
  scale_fill_viridis_d() +
  labs(y = "Conditional probability given previous state") +
  facet_wrap(~x_current, labeller = label_both)

From the visual above, it looks unlikely that this chain actually follows the Markov property. Following up with a formal test:

\(H_0\): The sequence is generated from a first-order Markov chain.

\(H_A\): The null hypothesis is false.

Code
verifyMarkovProperty(mags)
Testing markovianity property on given data sequence
Chi - square statistic is: 451.4307 
Degrees of freedom are: 165 
And corresponding p-value is: 0 

Because of the extremely low p-value, we have enough evidence to conclude that the sequence magnitudes of earthquakes is not plausibly a first-order Markov chain.

7)

Make another visual to explore if occurrences are independent. Connect the results back to previous findings. Is the Markov property necessary to model this behavior?

We can repeat the analysis above, but this time only inspecting the distribution of the next state conditioning on the current state.

Code
x_summary <- table(mags, lag(mags, 1)) %>%
  as.data.frame() %>%
  rename(x_current = "mags",
         x_next = "Var2", count = "Freq")

x_summary %>%
  ggplot(aes(x = x_current, y = count, fill = x_next)) +
  geom_bar(position = "fill",
           stat = "identity") +
  scale_fill_viridis_d() +
  labs(y = "Conditional probability given previous state")

Here, we can see that the conditional distributions are more stable. It is more plausible that the different states in the sequence are fully independent of each other rather than demonstrating the conditional independence of the Markov property.

Tying this back to part 5, we notice that the exponential distribution is a good fit for \(T\). The exponential distribution applies to the waiting time between events that are both independent and homogenously distributed through time. This may have been a good hint the that the events are independent, but don’t demonstrate the Markov property. As a result, a Markov chain analysis might not be necessary or the best analysis to perform on this type of data context.